PyPi dependencyΒΆ
Contribution statementΒΆ
We all helped each other with the different parts of the project. For choosing the topic and collecting the data we all helped and contributed equally. Later on, Anton took responsability for the network analysis, Andreas then stood for the textual analysis, and Mathias for the webpage and overall layout. Everyone still helped working on every part to properly agree on the final conclusions.
Github RepositoryΒΆ
A link to the assignment repository with all the relevant code is avaliable here, while the repository responsible for the webpage deployment can be found here.
WebpageΒΆ
The webpage is built using Github pages and hugo. You can access the webpage through this link, and a hosting of this repository's README on this link.
MotivationΒΆ
This project works to investigate a network constructed of python packages' dependencies on each other. This is based on the available packages on python package index (pypi). Inspiration of this project idea comes from seeing an older dataset on Netzschleuder made by Kevin Gullikson. His original blogpost is from 2016, and will therefore be used as a baseline for comparing the networks on a timescale.
We found it interesting to research how the package library in Python has evolved over the years and further investigate which packages seem to be the hubs of all other packages. Creating libraries is a big part of programming to help make functions easier to use across files and users, and thus we hope we also will gain insight into how these libraries potentially fall into different groups to see in which fields users most often create new libraries. By also analysing the text available in README files, we hope to gain insight into how programmers formulate themselves when writing documentation for their code, and if it varies across different groups.
Hopefully will the end user experience a breath of fresh air when reading through our analysis. It is a different and not often seen type of network, so we would like it, if the user found our research interesting not only from a computer science point of view, but also from a social science angle. While the project is inspiried by an interest for the idea of how the libraries will link togehter, it provides more insight into the habits of programmers, and it is this idea that we hope the end user will also learn something about, while reading through our website.
ImportsΒΆ
# Modules
import nest_asyncio
nest_asyncio.apply()
import json
from bs4 import BeautifulSoup
import re
import xmlrpc.client as xc
from tqdm import tqdm
import pickle
import aiohttp
import asyncio
import networkx as nx
import matplotlib.pyplot as plt
import netwulf as wulf
import numpy as np
import math
import random
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk import bigrams as make_bigrams
from nltk.collocations import BigramCollocationFinder
from nltk.metrics.association import BigramAssocMeasures
from scipy import stats
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
import wordcloud as wc
import ast
image_folder = 'data/images/' # Temporary folder to save and store images in
# Set client to the PyPI XML-RPC server
client = xc.ServerProxy('http://pypi.python.org/pypi')
# Get a list of all the packages
pypi_packages = client.list_packages()
# lowercase all the package names
pypi_packages = [package.lower() for package in pypi_packages]
# Save the list of packages
with open("data/packages.pkl", "wb") as f:
pickle.dump(pypi_packages, f)
We then define a function that will check the website for a link to its Github repository, where it is expected it will be under a certian class that is located in the left column. As it is a link to a website it will have an a-tag followed by a hyperref, so we collect all these and through regex collect any and all github links. If we find more than one link, we assume the correct one to be the shortest one, which is presumptuous, but after looking through a larger handfull it is the method we found to be the most successful.
async def get_github_link(packages: list, semaphore: asyncio.Semaphore) -> list:
"""
Function that takes a list of python packages and returns a list of tuples with the package name, the link to the PyPI page and the link to the GitHub page.
packages: list
List of python packages to search for.
semaphore: asyncio.Semaphore
Semaphore to limit the number of concurrent tasks.
return: list
List of tuples with the package name, the link to the PyPI page and the link to the GitHub page.
"""
all_links = []
async with semaphore, aiohttp.ClientSession() as session:
for i, package in enumerate(packages):
# The link to the python package
LINK = f"https://pypi.org/project/{package}/"
# Get the HTML content of the page
try:
async with session.get(LINK) as r:
if r.status != 200:
print(f"Request failed for {package, i}: {r.status}")
continue
content = await r.text()
except Exception as e:
print(f"Request failed for {package, i}: {str(e)}")
continue
# Parse the HTML content of the page
soup = BeautifulSoup(content)
# Get sidebar with links
sidebar = soup.find("div", {"class": "vertical-tabs__tabs"})
# Get all the links in the sidebar
references = [link.get("href") for link in sidebar.find_all("a")]
# Join into one string to regex in
reference_text = " ".join([reference for reference in references if reference is not None])
# Find the first link that contains the word "github.com"
github_links = []
for link in re.finditer(r"github\.com(/\w*|/\W|[-]\w*|[-]\W*)*", reference_text):
if link.group() != "github.com/" and link.group() != "github.com":
github_links.append(link.group())
# If there are no links, append None
if len(github_links) == 0:
github_link = None
# If there's several take the shortest and alert the user
elif len(github_links) > 1:
# print(f"Several GitHub links found for {package, i}: {github_links}")
github_link = min(github_links, key=len)
# If there is just one link, take that out of the list
elif len(github_links) == 1:
github_link = github_links[0]
# Else alert the user no githublink is found
else:
print(f"No GitHub link found for {package}")
github_link = None
# Append the triplet to the list
all_links.append((package, LINK, github_link))
# Sleep for a short period to avoid hitting rate limits
await asyncio.sleep(0.1)
return all_links
We then collect the links and clean up the following list to only include tuples were the github link exists.
# Load packages
with open("data/packages.pkl", "rb") as f:
pypi_packages = pickle.load(f)
# Create an event loop
loop = asyncio.get_event_loop()
# Create a semaphore
semaphore = asyncio.Semaphore(500) # Adjust this value as needed
# Create a list to hold the tasks
tasks = []
chunk_size = 80
# For each chunk of packages
for chunk in [pypi_packages[i:i+chunk_size] for i in range(0, len(pypi_packages), chunk_size)]:
# Create a task for the function and add it to the list
task = asyncio.ensure_future(get_github_link(chunk, semaphore))
tasks.append(task)
# Run all the tasks concurrently
all_links = loop.run_until_complete(asyncio.gather(*tasks))
# Flatten the list of lists into a single list
all_links = [link for sublist in all_links for link in sublist]
# Clean the list of None links
all_links = [(p, l, g) for p, l, g in all_links if g is not None]
# Save the list of links
with open("data/all_links_github.json", "w") as f:
json.dump(all_links, f)
with open("data/packages.pkl", "rb") as f:
pypi_packages = pickle.load(f)
print("Number of packages on pypi:", len(pypi_packages))
with open('data/all_links_github.json', 'r') as f:
data = json.load(f)
print("Number of packages to successfully get the github link from:", len(data))
Number of packages on pypi: 533827 Number of packages to successfully get the github link from: 347955
The number of packages left drastically shrinks from the inital 530k+ to almost 350k. As one would normally assume public code like packages to be well documented, these do come in have sizes, where small packages only by only a handfull of people within a certain area might not put documentation up online as that is available internally. It is also possible that it wasn't every single package that had the link to the repository in the same way as the majority of packages, and we thereby lost some due to needing to make some choices on how to webscrape the data.
Following this, we then have to collect the information of off the Github repository. We want to collect both any readme file and the dependencies the package has. This is done based on another assumption, where we found out after investigating several repositories that the content files on a repo has it own URL structure of the format https://raw.githubusercontent.com/brand-name/file-name. Therefore, we replace the github.com with the raw URL, and just try different combinations of branch and files names. These combinations are chosen based on the results we found on different naming conventions we could find on different small and big repositories including numpy, pandas, matplotlib and pytorch. A branch was either named main or master and the README files ended with .md, .rst or .txt. The dependencies had a naming convention of one of the following:
- requirements-dev.txt
- dev-requiremtns.txt
- environment.yml
- pyproject.toml
- requirements.txt
If none of the combinations were successful for the dependencies we discarded the package, as we then couldn't link it properly in the network. However, if it was only the README file we couldn't find, we stil kept the package for the network analysis and would later drop for the textual analysis. We did this to stay as true to the network as possible.
async def node_creator(packages: list, semaphore: asyncio.Semaphore) -> list:
"""
Function that takes a list of python packages and returns a list of nodes with the package name, readme text, and requirements text.
packages: list
List of python packages to search for.
semaphore: asyncio.Semaphore
Semaphore to limit the number of concurrent tasks.
return: list
List of nodes with the package name, readme text, and requirements text.
"""
# For each package go to the GitHub page and get the readme.text if theres a README.md
async def get_readme_text(session, github_link: str) -> list:
"""
Function that takes a GitHub link and returns the text of the README.md file.
github_link: str
Link to the GitHub page.
return: str
Text of the README.md file.
"""
# If there's no link, return None
if github_link is None:
return None
readme_file_extensions = ['README.md', 'README.rst', 'README.txt']
for extension in readme_file_extensions:
try:
async with session.get(f"{github_link}/main/{extension}") as response:
if response.status == 200:
readme_text = await response.text(errors='ignore')
break # Exit the loop if the readme file is found
async with session.get(f"{github_link}/master/{extension}") as response:
if response.status == 200:
readme_text = await response.text(errors='ignore')
break # Exit the loop if the readme file is found
except Exception as e:
print(e)
return None
try:
readme_text = BeautifulSoup(readme_text, "html.parser")
except:
return None
# Clean the text using regex
readme_text = re.sub(r"<.*?>", "", readme_text.text) # Remove HTML tags
readme_text = re.sub(r'```.*?```', '', readme_text, flags=re.DOTALL) # Remove code blocks
readme_text = re.sub(r"http.*", "", readme_text) # Remove links which start with http
readme_text = re.sub(r"/.*|./.*|../.*", "", readme_text) # Remove links to files in the repository which start with / or ./ or ../
readme_text = re.sub(r"\n", " ", readme_text) # Convert /n to space
readme_text = readme_text.lower() # Make all text lowercase
readme_text = re.sub(r"[^a-z0-9-_ ]", "", readme_text) # Only keep Alphanumeric characters and - and _
readme_text = re.sub(r" +", " ", readme_text) # Remove multiple spaces
readme_text = [line for line in readme_text.split(" ") if line != ""] # Remove empty strings
return readme_text
async def get_requirements_text(session, github_link):
"""
Function that takes a GitHub link and returns the text of the requirements.txt file.
github_link: str
Link to the GitHub page.
return: str
Text of the requirements.txt file.
"""
# If there's no link, return None
if github_link is None:
return None
txt_bool = True
pyproject_bool = False
requirements_file_extensions = ['requirements-dev.txt', 'dev-requirements.txt',
'environment.yml', 'pyproject.toml', 'requirements.txt']
requirements_text = None
for extension in requirements_file_extensions:
try:
async with session.get(f"{github_link}/main/{extension}") as response:
if response.status == 200:
requirements_text = await response.text(errors='ignore')
break # Exit the loop if the requirements file is found
async with session.get(f"{github_link}/master/{extension}") as response:
if response.status == 200:
requirements_text = await response.text(errors='ignore')
break # Exit the loop if the requirements file is found
except Exception as e:
print(e)
return None
if requirements_text is None:
return None
if extension[-4:] == '.txt':
txt_bool = True
pyproject_bool = False
elif extension[-4:] == '.yml':
txt_bool = False
pyproject_bool = False
elif extension[-5:] == '.toml':
txt_bool = False
pyproject_bool = True
# Clean the text using regex
cleaning_reg = r"=.*|>.*|~.*|\[.*\]|;.*|<.*|!.*"
if txt_bool:
# Example:
# versioneer[toml]
# cython~=3.0.5
# meson[ninja]==1.2.1
# meson-python==0.13.1
# pytest-qt>=4.2.0
# We only want the package name and not the version or extras
requirements_text = re.sub(r"\[.*\]", "", requirements_text)
requirements_text = re.sub(r"#.*", "", requirements_text) # Remove comments
requirements_text = re.sub(f"{cleaning_reg}", "", requirements_text) # Clean the text using regex
requirements_text = requirements_text.lower() # Lower case
requirements_text = requirements_text.split("\n") # Convert to list
requirements_text = [requirement.strip() for requirement in requirements_text] # Remove trailing spaces
requirements_text = [requirement for requirement in requirements_text if requirement != ""] # Remove empty strings
elif pyproject_bool:
# Example:
# [project]
# name = "pydata-sphinx-theme"
# description = "Bootstrap-based Sphinx theme from the PyData community"
# requires-python = ">=3.9"
# dependencies = [
# "Babel",
# "pygments>=2.7",
# ]
# [project.optional-dependencies]
# doc = [
# "numpydoc",
# "linkify-it-py", # for link shortening
# ]
requirements_text = re.sub(r"#.*", "", requirements_text) # Remove comments
dependencies = re.findall(r'dependencies = \[\n(.*?)\n\]', requirements_text, re.DOTALL)
optional_dependencies = re.findall(r'optional-dependencies\]\n.*? = \[\n(.*?)\n\]', requirements_text, re.DOTALL)
if len(dependencies) == 0 and len(optional_dependencies) == 0:
return None
if len(dependencies) == 0:
dependencies = [""]
if len(optional_dependencies) == 0:
optional_dependencies = [""]
# The names are either in '' or in ""
dependencies = re.findall(r'".*?"', dependencies[0]) + re.findall(r"'.*?'", dependencies[0])
optional_dependencies = re.findall(r'".*?"', optional_dependencies[0]) + re.findall(r"'.*?'", optional_dependencies[0])
requirements_text = dependencies + optional_dependencies
requirements_text = [requirement[1:-1] for requirement in requirements_text] # Remove double quotes
requirements_text = [re.sub(f"{cleaning_reg}", "", requirement) for requirement in requirements_text] # Clean the text using regex
requirements_text = [requirement.lower() for requirement in requirements_text] # Lower case
requirements_text = [requirement.strip() for requirement in requirements_text] # Remove trailing spaces
requirements_text = [requirement for requirement in requirements_text if requirement != ""] # Remove empty strings
else:
# Example:
# name: myenv
# channels:
# - defaults
# dependencies:
# - numpy
# - pip
# - pip:
# - matplotlib
requirements_text = re.sub(r"#.*", "", requirements_text) # Remove comments
requirements_text = re.findall(r"- .*", requirements_text) # Only get the dependencies which start with '- '
requirements_text = [re.sub(f"{cleaning_reg}", "", requirement) for requirement in requirements_text] # Clean the text using regex
requirements_text = [requirement.lower() for requirement in requirements_text] # Lower case
requirements_text = [requirement[2:] for requirement in requirements_text] # Convert to list
requirements_text = [requirement.strip() for requirement in requirements_text] # Remove trailing spaces
requirements_text = [requirement for requirement in requirements_text if requirement != ""] # Remove empty strings
return requirements_text
async with aiohttp.ClientSession('http://raw.githubusercontent.com/') as session:
nodes = []
for i, package in enumerate(packages):
try:
# The link to the python package
package_name, _, github_url = package
github_url = github_url.replace("github.com", "")
async with semaphore:
readme_text = await get_readme_text(session, github_url)
requirements_text = await get_requirements_text(session, github_url)
# Append the node to the list
nodes.append((package_name, readme_text, requirements_text))
except Exception as e:
print(f"Failed to get node for {package_name, i}: {str(e)}")
continue
return nodes
As seen from the above function we run the data collection asynchronously as this greatly speeds up the collection, as we're making multiple requests to the github site where we are I/O bound waiting for the downloads to happen and we need to get both the readme file and the dependency list.
In the next cell there is an example of the collection can be run and what the result looks like. In this case we're using the package numpy as an example.
# Example to show how a node gets generated
package_to_find = "numpy"
# Load clean data
with open('data/all_links_github.json', 'r') as f:
data = json.load(f)
semaphore = asyncio.Semaphore(500)
# Test the function
for package in data:
if package[0] == package_to_find:
test_data = package
break
# Create an event loop
loop = asyncio.get_event_loop()
# Create a list to hold the tasks
tasks = []
# Create a task for the function and add it to the list
task = asyncio.ensure_future(node_creator([test_data], semaphore))
# Run all the tasks concurrently
node = loop.run_until_complete(task)
print(node)
[('numpy', ['powered', 'by', 'numfocus', 'pypi', 'downloads', 'conda', 'downloads', 'stack', 'overflow', 'nature', 'paper', 'openssf', 'scorecard', 'numpy', 'is', 'the', 'fundamental', 'package', 'for', 'scientific', 'computing', 'with', 'python', '-', 'website', '-', 'documentation', '-', 'mailing', 'list', '-', 'source', 'code', '-', 'contributing', '-', 'bug', 'reports', '-', 'report', 'a', 'security', 'vulnerability', 'it', 'provides', '-', 'a', 'powerful', 'n-dimensional', 'array', 'object', '-', 'sophisticated', 'broadcasting', 'functions', '-', 'tools', 'for', 'integrating', '-', 'useful', 'linear', 'algebra', 'fourier', 'transform', 'and', 'random', 'number', 'capabilities', 'testing', 'numpy', 'requires', 'pytest', 'and', 'hypothesis', 'tests', 'can', 'then', 'be', 'run', 'after', 'installation', 'with', 'python', '-c', 'import', 'numpy', 'sys', 'sysexitnumpytest', 'is', 'false', 'code', 'of', 'conduct', '----------------------', 'numpy', 'is', 'a', 'community-driven', 'open', 'source', 'project', 'developed', 'by', 'a', 'diverse', 'group', 'of', 'contributors', 'commitment', 'to', 'creating', 'an', 'open', 'inclusive', 'and', 'positive', 'community', 'please', 'read', 'the', 'numpy', 'code', 'of', 'conduct', 'with', 'others', 'in', 'a', 'way', 'that', 'makes', 'our', 'community', 'thrive', 'call', 'for', 'contributions', '----------------------', 'the', 'numpy', 'project', 'welcomes', 'your', 'expertise', 'and', 'enthusiasm', 'small', 'improvements', 'or', 'fixes', 'are', 'always', 'appreciated', 'if', 'you', 'are', 'considering', 'larger', 'contributions', 'to', 'the', 'source', 'code', 'please', 'contact', 'us', 'through', 'the', 'mailing', 'list', 'writing', 'code', 'isnt', 'the', 'only', 'way', 'to', 'contribute', 'to', 'numpy', 'you', 'can', 'also', '-', 'review', 'pull', 'requests', '-', 'help', 'us', 'stay', 'on', 'top', 'of', 'new', 'and', 'old', 'issues', '-', 'develop', 'tutorials', 'presentations', 'and', 'other', 'educational', 'materials', '-', 'maintain', 'and', 'improve', 'our', 'website', '-', 'develop', 'graphic', 'design', 'for', 'our', 'brand', 'assets', 'and', 'promotional', 'materials', '-', 'translate', 'website', 'content', '-', 'help', 'with', 'outreach', 'and', 'onboard', 'new', 'contributors', '-', 'write', 'grant', 'proposals', 'and', 'help', 'with', 'other', 'fundraising', 'efforts', 'for', 'more', 'information', 'about', 'the', 'ways', 'you', 'can', 'contribute', 'to', 'numpy', 'visit', 'our', 'website', 'if', 'youre', 'unsure', 'where', 'to', 'start', 'or', 'how', 'your', 'skills', 'fit', 'in', 'reach', 'out', 'you', 'can', 'ask', 'on', 'the', 'mailing', 'list', 'or', 'here', 'on', 'github', 'by', 'opening', 'a', 'new', 'issue', 'or', 'leaving', 'a', 'comment', 'on', 'a', 'relevant', 'issue', 'that', 'is', 'already', 'open', 'our', 'preferred', 'channels', 'of', 'communication', 'are', 'all', 'public', 'but', 'if', 'youd', 'like', 'to', 'speak', 'to', 'us', 'in', 'private', 'first', 'contact', 'our', 'community', 'coordinators', 'at', 'numpy-teamgooglegroupscom', 'or', 'on', 'slack', 'write', 'numpy-teamgooglegroupscom', 'for', 'an', 'invitation', 'we', 'also', 'have', 'a', 'biweekly', 'community', 'call', 'details', 'of', 'which', 'are', 'announced', 'on', 'the', 'mailing', 'list', 'you', 'are', 'very', 'welcome', 'to', 'join', 'if', 'you', 'are', 'new', 'to', 'contributing', 'to', 'open', 'source', 'this', 'guide', 'and', 'how', 'to', 'successfully', 'get', 'involved'], ['conda-forge', 'python', 'cython', 'compilers', 'openblas', 'nomkl', 'setuptools', 'ninja', 'pkg-config', 'meson-python', 'pip', 'spin', 'ccache', 'pytest', 'pytest-cov', 'pytest-xdist', 'hypothesis', 'typing_extensions', 'mypy', 'sphinx', 'sphinx-design', 'numpydoc', 'ipython', 'scipy', 'pandas', 'matplotlib', 'pydata-sphinx-theme', 'doxygen', 'breathe', 'pycodestyle', 'gitpython', 'cffi', 'pytz'])]
Below we run the full collection process:
(Notice that we effictively end up not chunking several requests into the same session as we experienced issues with sending too many requests to the server.)
# Run the function with asyncio to get all the nodes
# Function to divide a list into chunks
def chunks(lst, n):
"""Yield successive n-sized chunks from lst."""
for i in range(0, len(lst), n):
yield lst[i:i + n]
async def process_chunks(data, semaphore):
nodes = []
for chunk in chunks(data, 1):
task = asyncio.ensure_future(node_creator(chunk, semaphore))
nodes.append(task)
# Run all tasks concurrently
nodes = await asyncio.gather(*nodes)
return nodes
async def main():
# Create a semaphore with a limit of 50 concurrent tasks
semaphore = asyncio.Semaphore(50)
# Process data in chunks
nodes = await process_chunks(data, semaphore)
# Flatten the list of lists into a single list
nodes = [node for sublist in nodes for node in sublist]
# Save the list to a json file
with open("data/nodes.json", "w") as f:
json.dump(nodes, f)
# Assuming you have an event loop already created
loop = asyncio.get_event_loop()
# Run the main coroutine
loop.run_until_complete(main())
Data PreprocessingΒΆ
We make the edge list by first getting the names of all the packages in the network. This is used as a reference point, for whether a requirement to a package should be added to the network, as we only want pairs were both nodes are nodes we have collected dependencies for.
# Now we make the edgelist
# Load the data
with open('data/nodes.json', 'r') as f:
nodes = json.load(f)
edge_list = []
extra_packages = set()
pypi_packages = np.array(pypi_packages) # Make a numpy array for faster searching
for node in tqdm(nodes):
if node[2] is None:
continue
for package in node[2]:
edge_list.append((node[0], package))
if package not in pypi_packages:
extra_packages.add(package)
continue
print(f"Number of packages not on pypi: {len(extra_packages)}")
# Save the edge list to a pickle file and csv for visualization
with open("data/edge_list.pkl", "wb") as f:
pickle.dump(edge_list, f)
df = pd.DataFrame(edge_list, columns=["source", "target"])
df.to_csv("data/edge_list.csv", index=False, escapechar="\\")
100%|ββββββββββ| 347955/347955 [18:38<00:00, 311.23it/s]
Number of packages not on pypi: 3694
We find that a total of 3,694 packages don't get added to the network. This is likely due to us including conda environment files, and libaries exclusive to e.g. conda-forge or linux can't be found on pypi.
Afterwards we also make the list of dictionaries into a single dictionary mapping the package to its information. Additionally, we join together the text into one string from a list of words before making the dataframe as we otherwise would be making the list itself into a string, and since we don't have a list of indexes for the unique names, we found to make the text now made sense.
# Convert the list of to a dictionary
nodes_mapping = {}
for node in nodes:
nodes_mapping[node[0]] = {"readme_text": node[1], "requirements_text": node[2]}
# Join the readme text into one string
for package in nodes_mapping.keys():
text = nodes_mapping[package]["readme_text"]
result = " ".join(text) if text is not None else ""
if result:
nodes_mapping[package]["readme_text"] = result
# Save the dictionary to a csv file
df = pd.DataFrame(nodes_mapping).T
df.index.name = "package"
df.reset_index(inplace=True)
df.to_csv("data/nodes_data.csv", index=False)
Data DescriptionΒΆ
After having collected the structed the data, we can get an overview of the amount of data we have for the different analysis we will be making.
data_set = pd.read_csv("data/nodes_data.csv")
print(f'The amount of nodes with requirements: {data_set["requirements_text"].notnull().sum()}')
print(f'The amount of nodes with readme: {data_set["readme_text"].notnull().sum()}')
print(f'The amount of nodes with both requirements and readme: {data_set[data_set["requirements_text"].notnull() & data_set["readme_text"].notnull()].shape[0]}')
print('-----------------------------------')
print(f'The amount of nodes with only requirements: {data_set[data_set["requirements_text"].notnull() & data_set["readme_text"].isnull()].shape[0]}')
print(f'The amount of nodes with only readme: {data_set[data_set["requirements_text"].isnull() & data_set["readme_text"].notnull()].shape[0]}')
print(f'The amount of nodes with neither requirements nor readme: {data_set[data_set["requirements_text"].isnull() & data_set["readme_text"].isnull()].shape[0]}')
The amount of nodes with requirements: 19250 The amount of nodes with readme: 48076 The amount of nodes with both requirements and readme: 17631 ----------------------------------- The amount of nodes with only requirements: 1619 The amount of nodes with only readme: 30445 The amount of nodes with neither requirements nor readme: 298260
As seen from the above output the retrieval success of readme and requirements was very low. However a reasonable explenation is that the combination of the approach taken to collect the data (which has its faults) and the fact that some amount of projects is also unlikely to have a readme file or a requirements file results in this. So in the end theres 17,631 'full' datapoints. However, for the textual analysis we are only concerned with data points that just have a readme file resulting in 48,076 data points.
Further we can make a network based on the edgelist and investigate the amount of nodes, link and degrees.
# Load the edge list
with open("data/edge_list.pkl", "rb") as f:
edge_list = pickle.load(f)
# Create a directed graph and add the edges
G = nx.DiGraph()
G.add_edges_from(edge_list)
print("Number of nodes:", G.number_of_nodes())
print("Number of edges:", G.number_of_edges())
print("Number of weakly connected components:", nx.number_weakly_connected_components(G))
Number of nodes: 32396 Number of edges: 150305 Number of weakly connected components: 347
We see here that the actual number of nodes in the network is 32,396 compared to the 347,955 nodes we made the edge list from. First of, we know that only 19.250 actually had a requirement file, and the additional nodes come from the dependencies these repositories have, since we allowed packages from PyPi to be added back in, if they were a dependency for another package.
# Get the number of nodes in the connected components
number_nodes_cc = sorted([len(component) for component in nx.weakly_connected_components(G)], reverse=True)
# Get the largest connected component
largest_cc = max(nx.weakly_connected_components(G), key=len)
G = G.subgraph(largest_cc).copy()
print("Number of nodes in the largest weakly connected component:", G.number_of_nodes())
print("Number of nodes in the following connected components:", ", ".join(str(np.unique(number_nodes_cc[1:]))[1:-1].split()) + ".")
Number of nodes in the largest weakly connected component: 31304 Number of nodes in the following connected components: 1, 2, 3, 4, 5, 6, 7, 8, 9, 13, 16, 17, 19, 21, 22, 25, 37, 83.
Above can be seen, that the size of the largest weakly connected component is 31,304, not drastically different from the initial network. However, by seen that the sizes of the other 347 weakly connected components all are between 1 and 83, it is small groups of nodes that are removed from the network which wouldn't add integral information to the analysis, as a weakly component is defined by being able to travese from any node in the set to any other. Thus these small groups aren't connected in any way to the main network.
# Network density
density = nx.density(G)
print(f'Network density: {density:.5f}')
# Network connection between nodes
is_connected = nx.is_weakly_connected(G)
print(f'Is the network connected: {is_connected}')
# Number of isolated nodes
isolated_nodes = list(nx.isolates(G))
print(f'Number of isolated nodes: {len(isolated_nodes)}')
# Degree
degree = dict(G.degree())
degree_values = list(degree.values())
# Average
avg_degree = np.mean(degree_values)
# Median
median_degree = np.median(degree_values)
# Mode
mode_degree = max(set(degree_values), key=degree_values.count)
# Minimum
min_degree = min(degree_values)
# Maximum
max_degree = max(degree_values)
print(f'Degree Analysis:')
print(f'Average: {avg_degree:.2f}')
print(f'Median: {median_degree}')
print(f'Mode: {mode_degree}')
print(f'Minimum: {min_degree}')
print(f'Maximum: {max_degree}')
Network density: 0.00015 Is the network connected: True Number of isolated nodes: 0 Degree Analysis: Average: 9.55 Median: 2.0 Mode: 1 Minimum: 1 Maximum: 3527
From the above information we can begin to understand the network a bit more. It is a large network varying a lot in the number of degrees a node has, thus keeping the density very low. It is of course a completely connected network, as it is already a subgraph of the inital network made on the largest weakly connected component. Additonally, the average number of dependencies a package has is 9 to 10, while most of the packages only have just 1 connection, and other packages are large hubs of many thousands connections.
Network plotΒΆ
By using an external tool called Cosmograph we were able to get a sensible and manageable plot of the entire network made from the edge list. Here we can see that how groups have formed in the network, and thereby why it makes sense that the largest component encompass almost all nodes in the network. The color of the nodes are scaled to the number of total links a nodes has, and we can see how the majority of the nodes are collected in the center of the big cluster. In here packages as requests, numpy, pandas and more can be found.
Network AnalysisΒΆ
We're very interested in analysing our network and determine if it has any sort of meaningfullness/structure. The best way to do this is using statistics, we get the necessesary baseline by generating random networks which can be used as a baseline, thereafter we can see if there's a significant difference.
Here we make a random model to see if the PyPi network has any form of meaning or if it's more likely just consitent of random relationships. We've included our original implementation of the random network generation however NetworkX' implementation is more optimised reducing the computation time from 25 minutes to under a second, so we use that instead.
# Make a random network to compare
# Calculate the number of edges in the network
L = G.number_of_edges()
# Calculate the number of nodes in the network
N = G.number_of_nodes()
# Calculate the average degree of the network
k = 2*L/N
print(f"Average degree of the real network: {k:.2f}")
# Calculate the probability p
p = k/(N-1)
print(f"Probability of a link between two nodes: {p:.2e}")
# The ErdΕs-RΓ©ny model is a random network model where the probability of a link between two nodes is constant and equal to p.
def generate_random_network(N, p):
"""
Generates a random network with N nodes and edges between nodes with probability p.
Parameters:
N (int): The number of nodes in the network.
p (float): The probability of adding an edge between two nodes.
Returns:
G (networkx.Graph): The generated random network.
"""
# # Create an empty graph
# G = nx.Graph()
# # Add N nodes
# G.add_nodes_from(range(N))
# # Add edges between nodes with probability p
# for i in tqdm(range(N-1)):
# for j in range(i+1, N):
# if np.random.uniform() < p:
# G.add_edge(i, j)
G = nx.generators.fast_gnp_random_graph(N, p)
return G
G_rand = generate_random_network(N, p)
Average degree of the real network: 9.55 Probability of a link between two nodes: 3.05e-04
We see that on average a node in the real network has 9.55 edges (outgoing).
print(f'Log of N: {math.log(N):.4}')
print(f"Average degree of the random network: {2*G_rand.number_of_edges()/G_rand.number_of_nodes():.2f}")
Log of N: 10.35 Average degree of the random network: 9.52
We see that the average degree of a node in the real and random network is over 1, and the log(N) is higher than the average degree. This means that they fall into to supercritical network regime. However we would still expect that the real network would have a different distribution of node degrees than a random network. Otherwise the real network would be very close to an ErdΕs-RΓ©ny randomized network. This is tested below:
# Make a plot of the degree distribution
def make_plot(bins: list[list[int]], hists: list[list[int]], title: str, labels: list[str]):
"""
Plot multiple histograms on a logarithmic scale.
Parameters:
bins (list[list[int]]): A list of bin edges for each histogram.
hists (list[list[int]]): A list of histogram values for each bin.
title (str): The title of the plot.
labels (list[str]): A list of labels for each histogram.
Returns:
None
"""
for bin, hist, label in zip(bins, hists, labels):
plt.plot(bin[:-1], hist, label=label)
plt.xscale('log')
plt.yscale('log')
plt.xlabel('Degree')
plt.ylabel('Density')
plt.title(title)
plt.legend()
plt.savefig(f'{image_folder}Degree_distribution.png', bbox_inches='tight')
plt.show()
# Compute the distribution of degree for the random network
degree_sequence_rd = [d for _, d in G_rand.degree()]
hist_rd, bins_rd = np.histogram(degree_sequence_rd, bins=np.logspace(0, 2.2, 13), density=True)
# Compute the distribution of degree for the real network
degree_sequence = [d for _, d in G.degree()]
hist_r, bins_r = np.histogram(degree_sequence, bins=np.logspace(0, 2.2, 13), density=True)
plt.axvline(k, color='black', linestyle='-', label='Real network average degree')
plt.axvline(np.mean(degree_sequence_rd), color='red', linestyle='--', label='Random network average degree')
make_plot([bins_rd, bins_r], [hist_rd, hist_r], 'Degree distribution of the real and random network', ['Random network', 'Real network'])
We see that the real network has a very distinctively different degree distributuion from the random network. Specifically, we see that the real network shows to clearly follow a power-law distribution. There's a larger amount of nodes with a few connections and that there exists 'hubs', which have a much higher degree $>k$, this being a characteristic of a heavy-tail distribution. The random network behaves quite different from the real network, clearly having a much higher $\alpha$ value, as it drops towards zero much faster than the real network, showing it doesn't have the same number of extreme values that the real network has. As the random network was correctly based on the real network as their average degree perfectly match, the distribution of the degrees has become more spread out in the random network
We can still investigate how these hubs are interconnected. We can calculate the assortativity of the nodes, first we have turned the graph into a undirected network to calculate the assortativity:
def degree_assortativity(G: nx.Graph) -> float:
"""
Calculate the degree assortativity coefficient of a graph.
The degree assortativity coefficient measures the correlation between the degrees of connected nodes in a graph.
It quantifies the tendency of nodes with similar degrees to be connected to each other.
Parameters:
- G (networkx.Graph): The input graph.
Returns:
- r (float): The degree assortativity coefficient of the graph.
Prerequisites:
- The input graph G should be an instance of the networkx.Graph class.
- The graph should have at least one edge.
"""
k_u = []
k_v = []
for u, v in G.edges():
k_u.append(G.degree(u))
k_v.append(G.degree(v))
for x, y in G.edges():
k_u.append(G.degree(y))
k_v.append(G.degree(x))
k_u = np.array(k_u)
k_v = np.array(k_v)
r = ((np.mean(k_u * k_v) - np.mean(k_u) * np.mean(k_v)) /
(np.sqrt(np.mean(k_u**2) - np.mean(k_u)**2) * np.sqrt(np.mean(k_v**2) - np.mean(k_v)**2)))
return r
r_degree = degree_assortativity(G.to_undirected())
print(f"Degree assortativity of the real network: {r_degree:.5f}")
Degree assortativity of the real network: -0.18327
We see that the network is dissortative meaning that nodes with high degree generally link to nodes with a low degree.
To have some sort of measure to compare this dissortativity to we make a random model based on a configuration model with the following properties:
- It should not have multilinks
- It should not have self-loops
This forces the graph to be simple and it will therefore develop structural disassortativity. This is something we want as we want to make a test which tests if the dissortativity displayed by the PyPi network is significant or simply a product of the fact that a package can't have itself as a dependency and that it of course can't depend several times on another package, it simply depends on a package or not.
# Configuation model: random network with a pre-defined degree sequence
def configuration_model(G: nx.Graph) -> nx.Graph:
"""
Generates a configuration model graph based on the input graph G.
The configuration model is a random graph model that preserves the degree sequence of the input graph.
It generates a new graph by randomly rewiring edges while maintaining the same degree distribution.
Parameters:
- G (nx.Graph): The input graph.
Returns:
- nx.Graph: The generated configuration model graph.
Prerequisites:
- The input graph G should be an instance of the networkx Graph class.
- The input graph G should have at least one edge.
"""
G_copy = G.copy()
edges = list(G_copy.edges())
idxs = list(range(len(edges)))
num_swaps = 10 * G_copy.number_of_edges()
for _ in range(num_swaps):
# Select two edges
idx1, idx2 = random.sample(idxs, 2)
e1, e2 = edges[idx1], edges[idx2]
# Flip the direction of e1 50% of the time
if random.random() < 0.5:
e1 = (e1[1], e1[0])
# Ensure new edges do not exist
if e1[0] not in G_copy.neighbors(e2[1]) and e2[0] not in G_copy.neighbors(e1[1]):
# Remove old edges and add new edges
G_copy.remove_edges_from([e1, e2])
G_copy.add_edges_from([(e1[0], e2[1]), (e2[0], e1[1])])
edges[idx1] = (e1[0], e2[1])
edges[idx2] = (e2[0], e1[1])
return G_copy
G_undir = G.to_undirected()
G_config = configuration_model(G_undir)
# Assert that the degree sequence of the configuration model is the same as the real network
assert all([G_undir.degree(node) == G_config.degree(node) for node in G_undir.nodes])
In the end we check that all the nodes in the two networks have the same degree, thereby confirming the algorithm is working as expected. Now we will generate 20 different random networks with the configuration model to see if they're significantly different from the PyPi network:
# Calculate the degree assortativity of multiple configuration models
assortativities = []
for _ in tqdm(range(20)):
G_config = configuration_model(G_undir)
r_config = degree_assortativity(G_config)
assortativities.append(r_config)
# Plot the distribution of the assortativities
plt.hist(assortativities, bins=5, label='Random networks')
# Plot the assortativity of the original network
plt.axvline(r_degree, color='red', linestyle='--', label='Real network')
plt.xlabel('Assortativity')
plt.ylabel('Frequency')
plt.title('Assortativity of the Configuration model network')
plt.legend()
plt.savefig(f'{image_folder}Assortativity.png', bbox_inches='tight')
plt.show()
0%| | 0/20 [00:00<?, ?it/s]
100%|ββββββββββ| 20/20 [19:47<00:00, 59.37s/it]
We see that the real network has a lower dissortativity than the random networks. This indicates that the dissortativity of the PyPi network is not just due to structural dissortativity, however this test was a simplification for the sake of understanding.
A much better test would be to let the PyPi network keep its directed property and calculate the assortativity based on direction and make a new configuration model algorithm with the added requirement:
- It should let the in- and out-degree stay the same during node swapping
Furthermore, we need to change the assortativity calculation to take account of pairs connected nodes (node: A and node: B) where we consider the in-degree of A, the in-degree of B and the out-degree of A, and the out-degree of B and all combinations of these.
This is done below:
def degree_assortativity_directed(G: nx.DiGraph):
"""
Calculate the degree assortativity for a directed graph.
Parameters:
- G (nx.DiGraph): A directed graph.
Returns:
- r_in_in (float): Degree assortativity for in-degree of nodes.
- r_out_out (float): Degree assortativity for out-degree of nodes.
- r_in_out (float): Degree assortativity between in-degree and out-degree of nodes.
- r_out_in (float): Degree assortativity between out-degree and in-degree of nodes.
Prerequisites:
- The input graph G should be a directed graph (nx.DiGraph).
- The graph G should have nodes and edges defined.
"""
# k_u and k_v are the degrees of nodes u and v, respectively
# We get the k_u_in, k_u_out, k_v_in and k_v_out
k_u_in = []
k_u_out = []
k_v_in = []
k_v_out = []
for u, v in G.edges():
k_u_in.append(G.in_degree(u))
k_v_in.append(G.in_degree(v))
k_u_out.append(G.out_degree(u))
k_v_out.append(G.out_degree(v))
k_u_in = np.array(k_u_in)
k_v_in = np.array(k_v_in)
k_u_out = np.array(k_u_out)
k_v_out = np.array(k_v_out)
r_in_in = ((np.mean(k_u_in * k_v_in) - np.mean(k_u_in) * np.mean(k_v_in)) /
(np.sqrt(np.mean(k_u_in**2) - np.mean(k_u_in)**2) * np.sqrt(np.mean(k_v_in**2) - np.mean(k_v_in)**2)))
r_out_out = ((np.mean(k_u_out * k_v_out) - np.mean(k_u_out) * np.mean(k_v_out)) /
(np.sqrt(np.mean(k_u_out**2) - np.mean(k_u_out)**2) * np.sqrt(np.mean(k_v_out**2) - np.mean(k_v_out)**2)))
r_in_out = ((np.mean(k_u_in * k_v_out) - np.mean(k_u_in) * np.mean(k_v_out)) /
(np.sqrt(np.mean(k_u_in**2) - np.mean(k_u_in)**2) * np.sqrt(np.mean(k_v_out**2) - np.mean(k_v_out)**2)))
r_out_in = ((np.mean(k_u_out * k_v_in) - np.mean(k_u_out) * np.mean(k_v_in)) /
(np.sqrt(np.mean(k_u_out**2) - np.mean(k_u_out)**2) * np.sqrt(np.mean(k_v_in**2) - np.mean(k_v_in)**2)))
return r_in_in, r_out_out, r_in_out, r_out_in
r_in_in_real, r_out_out_real, r_in_out_real, r_out_in_real = degree_assortativity_directed(G)
print(f"In-in degree assortativity of the real network: {r_in_in_real:.5f}")
print(f"Out-out degree assortativity of the real network: {r_out_out_real:.5f}")
print(f"In-out degree assortativity of the real network: {r_in_out_real:.5f}")
print(f"Out-in degree assortativity of the real network: {r_out_in_real:.5f}")
In-in degree assortativity of the real network: 0.00140 Out-out degree assortativity of the real network: -0.01516 In-out degree assortativity of the real network: 0.00238 Out-in degree assortativity of the real network: -0.18149
We see that the in-in degree assortativity and the in-out degree assortativity is just slightly positive but close to neutral (0).
However the out-out degree is slightly more disassortative with a value of -0.015. Meaning that packages that depend on a lot of packages have a tendency to be linked with packages that depend on less packages.
(As we defined that a package points to its dependency)
Furthermore we see even stronger signs of disassortativity with the out-in degree, this means that packages that depend on a lot of packages have a tendency to be linked to packages which has fewer packages depending on it.
We introduce a new way to measure assortativity: Degree correlation. Below we implement an algorithm which calculates the four possible degree correlations.
We define this algorithm as to calculate the degree correlation; $k_{nn}^{\alpha, \beta}(k)$ where $\alpha$ and $\beta$ refer to the in and out degrees. It mirrors the undirected function for degree correlation: $$k_{nn}(k_i) = \frac{1}{k_i} \sum^{N}_{j=1}{A_{ij}k_j}$$ where $A_{ij}$ is the adjacency matrix.
We use the same calculation but we just consider the in/out degree too, hereby making a familiy of functions which consider different combinations of $\alpha$: in, and $\beta$: out degree for the node $k_i$:
$$k_{nn}^{\alpha, \beta}(k_i^{\alpha, \beta}) = \frac{1}{k_i^{\alpha, \beta}} \sum^{N}_{j=1}{A_{ij}k_j^{\alpha, \beta}}$$
This is shown below:
def degree_correlation_directed(G: nx.DiGraph):
"""
Calculate the degree correlation for a directed graph.
Parameters:
- G (nx.DiGraph): A directed graph.
Returns:
- k_in_in (dict): A dictionary with the average neighbor in-degree of node with in-degree k'.
- k_in_out (dict): A dictionary with the average neighbor out-degree of node with in-degree k'.
- k_out_in (dict): A dictionary with the average neighbor in-degree of node with out-degree k'.
- k_out_out (dict): A dictionary with the average neighbor out-degree of node with out-degree k'.
Prerequisites:
- The input graph G should be a directed graph (nx.DiGraph).
- The graph G should have nodes and edges defined.
"""
# For each node with degree k' we want to get the average of neighbor degree in and neighbor degree out
k_in_in = {}
k_in_out = {}
k_out_in = {}
k_out_out = {}
for u, v in G.edges():
if G.in_degree(u) not in k_in_in:
k_in_in[G.in_degree(u)] = []
if G.in_degree(u) not in k_in_out:
k_in_out[G.in_degree(u)] = []
if G.out_degree(u) not in k_out_in:
k_out_in[G.out_degree(u)] = []
if G.out_degree(u) not in k_out_out:
k_out_out[G.out_degree(u)] = []
for u, v in G.edges():
k_in_in[G.in_degree(u)].append(G.in_degree(v))
k_in_out[G.in_degree(u)].append(G.out_degree(v))
k_out_in[G.out_degree(u)].append(G.in_degree(v))
k_out_out[G.out_degree(u)].append(G.out_degree(v))
k_in_in = {k: np.mean(v) for k, v in k_in_in.items()}
k_in_out = {k: np.mean(v) for k, v in k_in_out.items()}
k_out_in = {k: np.mean(v) for k, v in k_out_in.items()}
k_out_out = {k: np.mean(v) for k, v in k_out_out.items()}
return k_in_in, k_in_out, k_out_in, k_out_out
k_in_in, k_in_out, k_out_in, k_out_out = degree_correlation_directed(G)
print(f"Average neighbor in-degree of node with in-degree k': {k_in_in}")
print(f"Average neighbor out-degree of node with in-degree k': {k_in_out}")
print(f"Average neighbor in-degree of node with out-degree k': {k_out_in}")
print(f"Average neighbor out-degree of node with out-degree k': {k_out_out}")
Average neighbor in-degree of node with in-degree k': {0: 629.8249263825757, 1: 510.66217303822935, 1231: 683.4736842105264, 101: 598.6285714285714, 1014: 1034.3333333333333, 182: 2438.5, 2: 596.9806835066864, 26: 524.0666666666667, 27: 757.625, 183: 128.33333333333334, 31: 468.97222222222223, 22: 560.0454545454545, 85: 568.6153846153846, 5: 528.7368421052631, 10: 594.2592592592592, 3: 656.3850129198967, 60: 1205.75, 374: 1807.4, 25: 1043.2727272727273, 113: 316.1666666666667, 69: 1041.5, 41: 529.6428571428571, 4: 536.3589743589744, 11: 1169.225806451613, 13: 705.2272727272727, 36: 1695.3333333333333, 21: 493.25, 42: 430.46875, 12: 641.75, 15: 815.12, 30: 485.34, 47: 739.9473684210526, 44: 1851.0, 35: 799.1666666666666, 8: 848.6785714285714, 179: 773.0, 98: 1828.0, 7: 692.9833333333333, 9: 518.5762711864406, 51: 485.57894736842104, 14: 494.93939393939394, 16: 50.348765432098766, 34: 1040.5714285714287, 19: 498.8, 49: 1223.2, 103: 461.4, 6: 788.1704545454545, 20: 289.9, 61: 747.2857142857143, 108: 80.71428571428571, 32: 1781.5, 17: 183.3111111111111, 18: 264.75, 29: 1223.2}
Average neighbor out-degree of node with in-degree k': {0: 0.6394618768220771, 1: 1.1578470824949698, 1231: 0.07894736842105263, 101: 1.2571428571428571, 1014: 0.0, 182: 0.0, 2: 0.9435364041604755, 26: 1.2, 27: 0.25, 183: 0.0, 31: 0.6388888888888888, 22: 3.0454545454545454, 85: 0.6923076923076923, 5: 1.406015037593985, 10: 1.0740740740740742, 3: 0.8527131782945736, 60: 0.0, 374: 0.0, 25: 0.0, 113: 3.1666666666666665, 69: 0.0, 41: 0.14285714285714285, 4: 0.3418803418803419, 11: 0.0, 13: 0.09090909090909091, 36: 0.0, 21: 1.5, 42: 0.15625, 12: 1.5357142857142858, 15: 2.28, 30: 1.06, 47: 1.2105263157894737, 44: 0.0, 35: 0.0, 8: 0.14285714285714285, 179: 0.0, 98: 0.0, 7: 0.35, 9: 0.6440677966101694, 51: 2.6315789473684212, 14: 1.4242424242424243, 16: 11.070987654320987, 34: 1.2857142857142858, 19: 1.1, 49: 0.0, 103: 0.0, 6: 0.4659090909090909, 20: 3.2, 61: 0.0, 108: 0.8571428571428571, 32: 0.0, 17: 7.377777777777778, 18: 0.0, 29: 0.0}
Average neighbor in-degree of node with out-degree k': {1: 1287.661167200197, 12: 592.5572660098522, 2: 873.2961028525513, 8: 794.9453125, 6: 780.1302588996764, 38: 527.8355263157895, 39: 520.7846153846153, 4: 838.2742513020834, 13: 682.9600360576923, 17: 471.609477124183, 24: 566.9836711711712, 47: 510.57774140752866, 14: 650.8044144696505, 9: 707.7955695098552, 21: 582.5611879160266, 7: 855.0126050420168, 10: 751.775221238938, 104: 245.32692307692307, 19: 599.038188494492, 3: 921.5618593142453, 16: 666.4986559139785, 35: 511.26, 11: 700.0380549682875, 5: 895.3385416666666, 169: 265.82642998027615, 73: 395.3321917808219, 18: 519.182239057239, 26: 515.7722355769231, 25: 538.6570588235294, 100: 304.825, 32: 600.2316810344828, 15: 655.2761421319797, 23: 576.5183066361556, 28: 530.8973214285714, 63: 318.7142857142857, 59: 311.06779661016947, 20: 606.4010869565218, 27: 450.84863123993557, 37: 450.0705052878966, 30: 514.1703703703704, 36: 521.6861111111111, 113: 293.4424778761062, 80: 281.85833333333335, 22: 585.516083916084, 34: 456.2020460358056, 62: 476.1958525345622, 33: 524.3013468013468, 84: 431.9375, 118: 379.29661016949154, 41: 543.9192073170732, 129: 325.5348837209302, 50: 430.3890909090909, 90: 472.9555555555556, 177: 239.19209039548022, 45: 495.40555555555557, 52: 484.60042735042737, 87: 416.71264367816093, 86: 400.93023255813955, 43: 462.00697674418603, 68: 359.5980392156863, 48: 435.38425925925924, 91: 406.95421245421244, 31: 557.1979977753059, 51: 489.8692810457516, 42: 423.90702947845807, 57: 280.4502923976608, 40: 497.2323529411765, 218: 203.25688073394497, 128: 208.31770833333334, 98: 384.4591836734694, 29: 503.5666356011184, 54: 459.43386243386243, 55: 310.91636363636366, 67: 436.2723880597015, 165: 281.3939393939394, 65: 366.9876923076923, 217: 292.9400921658986, 74: 426.2668918918919, 184: 260.8804347826087, 64: 499.5, 58: 536.8571428571429, 75: 404.55333333333334, 160: 244.3, 89: 343.8426966292135, 76: 472.5, 53: 433.23018867924526, 97: 308.7680412371134, 69: 387.7028985507246, 136: 260.5147058823529, 359: 177.10027855153203, 134: 310.44029850746267, 72: 405.031746031746, 56: 525.3412698412699, 167: 266.0658682634731, 60: 422.6483333333333, 105: 382.87619047619046, 79: 374.25949367088606, 102: 336.4542483660131, 140: 343.57142857142856, 46: 457.9717391304348, 44: 467.62175324675326, 77: 446.9134199134199, 82: 387.4207317073171, 139: 353.74340527577937, 99: 341.51262626262627, 92: 411.8695652173913, 133: 308.2781954887218, 95: 358.88070175438594, 103: 268.50485436893206, 271: 183.2952029520295, 78: 499.06089743589746, 154: 321.97402597402595, 147: 376.91156462585036, 88: 347.9261363636364, 234: 246.63675213675214, 260: 181.5076923076923, 183: 233.672131147541, 312: 208.9102564102564, 61: 440.2974238875878, 125: 337.688, 123: 317.3983739837398, 93: 276.4247311827957, 231: 252.82251082251082, 246: 92.36178861788618, 187: 195.2139037433155, 109: 339.70642201834863, 96: 377.265625, 166: 255.27409638554218, 132: 314.719696969697, 94: 333.27659574468083, 66: 537.6313131313132, 215: 256.9162790697674, 49: 413.61224489795916, 265: 239.02264150943395, 195: 310.32307692307694, 135: 336.6148148148148, 194: 282.94329896907215, 126: 329.531746031746, 83: 376.5301204819277, 142: 232.30281690140845, 114: 353.44736842105266, 273: 222.51282051282053, 181: 278.3066298342541, 305: 189.34754098360656, 121: 327.07438016528926, 146: 370.7260273972603, 112: 378.26785714285717, 222: 221.73873873873873, 174: 225.29885057471265, 71: 407.9154929577465, 117: 294.8162393162393, 220: 253.83863636363637, 110: 342.32272727272726, 214: 256.27570093457945, 138: 295.9347826086956, 168: 338.7083333333333, 156: 292.55128205128204, 163: 233.72392638036808, 185: 192.98918918918918, 266: 204.01503759398497, 85: 260.21176470588233, 122: 277.6311475409836, 81: 440.037037037037, 171: 258.906432748538, 115: 223.22608695652173, 153: 78.13071895424837, 70: 328.84285714285716, 162: 270.7098765432099, 120: 342.2, 281: 195.6085409252669}
Average neighbor out-degree of node with out-degree k': {1: 0.33907904457030286, 12: 0.5424876847290641, 2: 0.4013660104459622, 8: 0.8344983552631579, 6: 0.7378640776699029, 38: 0.5874060150375939, 39: 0.6410256410256411, 4: 0.77880859375, 13: 0.6610576923076923, 17: 0.6264705882352941, 24: 0.7224099099099099, 47: 0.8821603927986906, 14: 0.8985285101164929, 9: 0.7694051979766265, 21: 0.819252432155658, 7: 0.7641806722689075, 10: 0.879424778761062, 104: 0.2980769230769231, 19: 0.3397796817625459, 3: 0.9093319194061505, 16: 0.801747311827957, 35: 0.87, 11: 0.6889828517735495, 5: 0.6616319444444444, 169: 0.3530571992110454, 73: 0.5102739726027398, 18: 0.9886363636363636, 26: 0.6875, 25: 0.7405882352941177, 100: 0.34, 32: 0.875, 15: 0.9648054145516074, 23: 0.6578947368421053, 28: 0.7321428571428571, 63: 0.17989417989417988, 59: 0.15254237288135594, 20: 0.7951086956521739, 27: 3.504562533548041, 37: 0.7038777908343126, 30: 0.8197530864197531, 36: 0.7319444444444444, 113: 0.9424778761061947, 80: 0.2375, 22: 0.8524475524475524, 34: 0.6572890025575447, 62: 0.5460829493087558, 33: 0.7424242424242424, 84: 0.7946428571428571, 118: 0.2711864406779661, 41: 0.8429878048780488, 129: 0.32945736434108525, 50: 0.4581818181818182, 90: 1.5777777777777777, 177: 0.15254237288135594, 45: 0.46111111111111114, 52: 0.8589743589743589, 87: 0.3371647509578544, 86: 0.45348837209302323, 43: 0.28837209302325584, 68: 0.8676470588235294, 48: 0.33796296296296297, 91: 0.6025641025641025, 31: 0.7819799777530589, 51: 0.4411764705882353, 42: 0.6802721088435374, 57: 0.3216374269005848, 40: 0.6838235294117647, 218: 1.2339449541284404, 128: 0.265625, 98: 0.3622448979591837, 29: 0.7036346691519105, 54: 0.798941798941799, 55: 0.7090909090909091, 67: 0.458955223880597, 165: 0.19393939393939394, 65: 0.5507692307692308, 217: 1.3502304147465438, 74: 0.23648648648648649, 184: 0.29347826086956524, 64: 0.234375, 58: 0.6527093596059114, 75: 0.6577777777777778, 160: 1.45, 89: 0.48314606741573035, 76: 0.5350877192982456, 53: 0.5509433962264151, 97: 1.2164948453608246, 69: 0.34057971014492755, 136: 0.5205882352941177, 359: 0.41225626740947074, 134: 1.462686567164179, 72: 0.4146825396825397, 56: 0.6468253968253969, 167: 0.47005988023952094, 60: 0.355, 105: 0.9523809523809523, 79: 0.06329113924050633, 102: 0.39215686274509803, 140: 0.39285714285714285, 46: 0.9304347826086956, 44: 0.4577922077922078, 77: 0.5714285714285714, 82: 0.6280487804878049, 139: 0.4580335731414868, 99: 0.4166666666666667, 92: 0.41304347826086957, 133: 0.3458646616541353, 95: 0.312280701754386, 103: 0.44660194174757284, 271: 0.4870848708487085, 78: 0.5288461538461539, 154: 0.6363636363636364, 147: 0.5510204081632653, 88: 0.8920454545454546, 234: 0.31196581196581197, 260: 0.3730769230769231, 183: 1.453551912568306, 312: 0.5865384615384616, 61: 0.45433255269320844, 125: 0.428, 123: 0.34146341463414637, 93: 0.8172043010752689, 231: 0.7186147186147186, 246: 0.13414634146341464, 187: 0.1443850267379679, 109: 0.1834862385321101, 96: 0.4895833333333333, 166: 0.3644578313253012, 132: 0.8181818181818182, 94: 0.09574468085106383, 66: 0.30808080808080807, 215: 0.4, 49: 0.7108843537414966, 265: 0.3320754716981132, 195: 0.3230769230769231, 135: 0.28888888888888886, 194: 0.3917525773195876, 126: 0.5158730158730159, 83: 0.3373493975903614, 142: 0.36619718309859156, 114: 1.0043859649122806, 273: 0.2967032967032967, 181: 1.3812154696132597, 305: 0.3180327868852459, 121: 0.10743801652892562, 146: 0.541095890410959, 112: 1.4821428571428572, 222: 1.563063063063063, 174: 0.5057471264367817, 71: 0.2605633802816901, 117: 0.5085470085470085, 220: 0.5181818181818182, 110: 0.5045454545454545, 214: 0.3691588785046729, 138: 0.5869565217391305, 168: 0.6130952380952381, 156: 1.4871794871794872, 163: 0.31901840490797545, 185: 1.1297297297297297, 266: 0.33458646616541354, 85: 0.10588235294117647, 122: 0.4016393442622951, 81: 1.0864197530864197, 171: 0.4444444444444444, 115: 0.4434782608695652, 153: 0.12418300653594772, 70: 1.9, 162: 1.2592592592592593, 120: 0.575, 281: 0.49466192170818507}
We see that the avereage neighbor in-degree of nodes with in-degree zero is 629.8. This mean that packages which are the requirement of no other packages generally are pointing to packages which are the requirement of many packages. This makes a lot of sense as there generally are some extremely popular packages which many packages have as requirement but these packages themselves might not be the dependency of any other package.
These results are interesting but it would take a long time to go through all the values. Therefore we've plotted them below:
# Get the x values from the keys of the dictionaries
x_values = list(range(0,1000))
# Get the y values from the values of the dictionaries
y_values_in_in = [k_in_in.get(x, np.nan) for x in x_values]
y_values_out_out = [k_in_out.get(x, np.nan) for x in x_values]
y_values_in_out = [k_out_in.get(x, np.nan) for x in x_values]
y_values_out_in = [k_out_out.get(x, np.nan) for x in x_values]
# Plot the degree correlation
plt.plot(x_values, y_values_in_in, label='In-in degree correlation')
plt.plot(x_values, y_values_out_out, label='Out-out degree correlation')
plt.plot(x_values, y_values_in_out, label='In-out degree correlation')
plt.plot(x_values, y_values_out_in, label='Out-in degree correlation')
plt.xscale('log')
plt.yscale('log')
plt.xlabel("Degree")
plt.ylabel("Average neighbor degree")
plt.legend(loc='lower right')
plt.title("Degree correlation")
plt.savefig(f'{image_folder}Degree_correlation.png', bbox_inches='tight')
plt.show()
The plot displays clear disassortative behavior for the in-out degree correlation and the out-out degree correlation (the negative slope). Also the out-out and out-in degree correlations have some very steep dips.
However both of these results (directed degree correlation and directed assortativity) should be compared to a random model to check if they're just displaying structural patterns or if there's any significance to the results i.e. not randomly appearing. Therefore, we've made a degree preserving randomization (R-S randomization) which follow the 3 rules set earlier:
- It should not have multilinks
- It should not have self-loops
- It should let the in- and out-degree stay the same during node swapping
This is done below:
# We do a degree preserving randomisation : R-S randomization to the plot
def degree_preserving_randomization(G: nx.DiGraph):
"""
Perform degree-preserving randomization of a directed graph.
Parameters:
- G (nx.DiGraph): A directed graph.
Returns:
- G_copy (nx.DiGraph): The degree-preserving randomized graph.
Prerequisites:
- The input graph G should be a directed graph (nx.DiGraph).
- The graph G should have nodes and edges defined.
"""
# Create a copy of the graph
G_copy = G.copy()
# Get the edges and indexes
edges = list(G_copy.edges())
idxs = list(range(len(edges)))
# Get the number of edges
num_swaps = 10 * G_copy.number_of_edges()
# For each swap
for _ in range(num_swaps):
# Select two sets of connected
idx1, idx2 = random.sample(idxs, 2)
u, v = edges[idx1]
x, y = edges[idx2]
# Ensure distinct nodes
if len(set([u, v, x, y])) < 4:
continue
# Ensure that the new edges do not exist
if x not in G_copy.neighbors(v) and y not in G_copy.neighbors(u):
# Remove the old edges
G_copy.remove_edges_from([(u, v), (x, y)])
# Add the new edges
G_copy.add_edges_from([(u, y), (x, v)])
# Update the edges
edges[idx1] = (u, y)
edges[idx2] = (x, v)
return G_copy
G_degree_randomized = degree_preserving_randomization(G)
k_in_in_randomized, k_in_out_randomized, k_out_in_randomized, k_out_out_randomized = degree_correlation_directed(G_degree_randomized)
This we will do 50 times and display on the same plots as before. This way will we see if the PyPi network is significantly different than the random models and if the disassortativity is not only due to structural disassortativity.
# We collect the values created by the R-S randomized graphs
r_in_in_R_S = []
r_out_out_R_S = []
r_in_out_R_S = []
r_out_in_R_S = []
y_values_in_in_randomized = []
y_values_out_out_randomized = []
y_values_in_out_randomized = []
y_values_out_in_randomized = []
for _ in tqdm(range(50)):
G_r_s = degree_preserving_randomization(G)
r_in_in, r_out_out, r_in_out, r_out_in = degree_assortativity_directed(G_r_s)
r_in_in_R_S.append(r_in_in)
r_out_out_R_S.append(r_out_out)
r_in_out_R_S.append(r_in_out)
r_out_in_R_S.append(r_out_in)
k_in_in_randomized, k_in_out_randomized, k_out_in_randomized, k_out_out_randomized = degree_correlation_directed(G_r_s)
y_values_in_in_randomized.append([k_in_in_randomized.get(x, np.nan) for x in x_values])
y_values_out_out_randomized.append([k_in_out_randomized.get(x, np.nan) for x in x_values])
y_values_in_out_randomized.append([k_out_in_randomized.get(x, np.nan) for x in x_values])
y_values_out_in_randomized.append([k_out_out_randomized.get(x, np.nan) for x in x_values])
# Get the mean
y_values_in_in_randomized_mean = np.mean(y_values_in_in_randomized, axis=0)
y_values_out_out_randomized_mean = np.mean(y_values_out_out_randomized, axis=0)
y_values_in_out_randomized_mean = np.mean(y_values_in_out_randomized, axis=0)
y_values_out_in_randomized_mean = np.mean(y_values_out_in_randomized, axis=0)
# Get the std
y_values_in_in_randomized_std = np.std(y_values_in_in_randomized, axis=0)
y_values_out_out_randomized_std = np.std(y_values_out_out_randomized, axis=0)
y_values_in_out_randomized_std = np.std(y_values_in_out_randomized, axis=0)
y_values_out_in_randomized_std = np.std(y_values_out_in_randomized, axis=0)
100%|ββββββββββ| 50/50 [20:42<00:00, 24.86s/it]
We show the degree correlation while displaying the 50 random networks mean value and show 2 standard deviations over and under the mean:
# Get the y values from the values of the dictionaries for the randomized network
# Plot the degree correlation
plt.figure(figsize=(12, 10))
plt.plot(x_values, y_values_in_in, label='In-in degree correlation', color='blue', marker = 'o')
plt.plot(x_values, y_values_out_out, label='Out-out degree correlation', color='orange', marker='o')
plt.plot(x_values, y_values_in_out, label='In-out degree correlation', color='green', marker='o')
plt.plot(x_values, y_values_out_in, label='Out-in degree correlation', color='red', marker='o')
plt.plot(x_values, y_values_in_in_randomized_mean, label='In-in degree correlation randomized', linestyle='None', marker = 'D', color = 'blue', markerfacecolor = 'None')
plt.plot(x_values, y_values_out_out_randomized_mean, label='Out-out degree correlation randomized', linestyle='None', marker = 'D', color = 'orange', markerfacecolor = 'None')
plt.plot(x_values, y_values_in_out_randomized_mean, label='In-out degree correlation randomized', linestyle='None', marker = 'D', color = 'green', markerfacecolor = 'None')
plt.plot(x_values, y_values_out_in_randomized_mean, label='Out-in degree correlation randomized', linestyle='None', marker = 'D', color = 'red', markerfacecolor = 'None')
lower_bound_in_in, upper_bound_in_in = y_values_in_in_randomized_mean - y_values_in_in_randomized_std*2, y_values_in_in_randomized_mean + y_values_in_in_randomized_std*2
lower_bound_out_out, upper_bound_out_out = y_values_out_out_randomized_mean - y_values_out_out_randomized_std*2, y_values_out_out_randomized_mean + y_values_out_out_randomized_std*2
lower_bound_in_out, upper_bound_in_out = y_values_in_out_randomized_mean - y_values_in_out_randomized_std*2, y_values_in_out_randomized_mean + y_values_in_out_randomized_std*2
lower_bound_out_in, upper_bound_out_in = y_values_out_in_randomized_mean - y_values_out_in_randomized_std*2, y_values_out_in_randomized_mean + y_values_out_in_randomized_std*2
plt.fill_between(x_values, lower_bound_in_in, upper_bound_in_in, color='blue', alpha=0.4)
plt.fill_between(x_values, lower_bound_out_out, upper_bound_out_out, color='orange', alpha=0.4)
plt.fill_between(x_values, lower_bound_in_out, upper_bound_in_out, color='green', alpha=0.4)
plt.fill_between(x_values, lower_bound_out_in, upper_bound_out_in, color='red', alpha=0.4)
plt.xscale('log')
plt.yscale('log')
plt.xlabel("Degree")
plt.ylabel("Average neighbor degree")
plt.legend(loc='lower right')
plt.title("Degree correlation")
plt.savefig(f'{image_folder}Degree_correlation_randomized.png', bbox_inches='tight')
plt.show()
This plot also shows clear disassortative behavior for the in-out degree correlation however it also shows that the disassortativity for the random models in-out is much flatter and looks more linear. This indicates that the disassortativity there is in the PyPi network is not due to structural disassortativity. However, the random networks have a very large range for the out-out degree correlation so it's hard to say anything definitive based on this alone. Generally the values of the real network are outside the range of the random models.
Below we plot the assortativities:
# Make the same plots with the degree preserving randomization
# Plot the distribution of the assortativities
plt.hist(r_in_in_R_S, bins=10, label='In-in degree assortativity', color='red')
plt.hist(r_out_out_R_S, bins=10, label='Out-out degree assortativity', color='green')
plt.hist(r_in_out_R_S, bins=10, label='In-out degree assortativity', color='blue')
plt.hist(r_out_in_R_S, bins=10, label='Out-in degree assortativity', color='orange')
# Plot the assortativity of the original network
plt.axvline(r_in_in_real, color='red', linestyle='--', label='Real network In-in')
plt.axvline(r_out_out_real, color='green', linestyle='--', label='Real network Out-out')
plt.axvline(r_in_out_real, color='blue', linestyle='--', label='Real network In-out')
plt.axvline(r_out_in_real, color='orange', linestyle='--', label='Real network Out-in')
plt.xlabel('Assortativity')
plt.ylabel('Frequency')
plt.title('Assortativity of the random networks (R-S randomization)')
plt.legend()
plt.savefig(f'{image_folder}Assortativity_column_R_S.png', bbox_inches='tight')
plt.show()
Again we see that especially the out-in assortativity is different from the random models, but also the out-out assortativity seems to be different that that of a random network.
A last plot is made to give a different visualisation of the degree differences between the real and random networks.
# Instead plot it as a line plot where the x_values are in_in, in_out, out_in, out_out:
x_values = ['In-in', 'In-out', 'Out-in', 'Out-out']
# Get the y values from the values of the dictionaries
y_values = [r_in_in_real, r_in_out_real, r_out_in_real, r_out_out_real]
fig = plt.figure()
plt.plot(x_values, y_values, label='Real network')
for i, (r_in_in, r_out_out, r_in_out, r_out_in) in enumerate(zip(r_in_in_R_S, r_out_out_R_S, r_in_out_R_S, r_out_in_R_S)):
y_values = [r_in_in, r_in_out, r_out_in, r_out_out]
plt.plot(x_values, y_values, alpha=0.25, color='red')
plt.xlabel("Degree correlation")
plt.ylabel("Assortativity")
plt.title("Assortativity of the random networks (R-S randomization)")
plt.legend()
plt.savefig(f'{image_folder}Assortativity_lineplot_R_S.png', bbox_inches='tight')
plt.show()
Again the out-in degree is especially different from the random networks, which are shown as the red overlapping lines in the above plot. Additionally, the out-out degree again looks to be significantly different.
Part 1 of Conclusion on the Graph Analysis.ΒΆ
The network of PyPi packages is not random and it shows very clear signs of dissortativity, specifically for the out-in and a little less for the out-out degrees. This indicates that packages in the network which have a lot of packages depending on the them don't depend on a lot of packages themselves.
We also want to investigate which packages are most central in this graph. That we will do in the rest of the graph analysis. First we plot the closeness centrality and the eignevector centrality:
# Find the 5 most central packages according to degree centrality.
closeness_centrality = nx.closeness_centrality(G)
sorted_closeness_centrality = sorted(closeness_centrality.items(), key=lambda x: x[1], reverse=True)
print(f'The 5 most central packages according to closeness centrality are: {[str(package) for package, centrality in sorted_closeness_centrality[:5]]}')
# Find the 5 most central packages according to eigenvector centrality.
eigenvector_centrality = nx.eigenvector_centrality(G)
sorted_eigenvector_centrality = sorted(eigenvector_centrality.items(), key=lambda x: x[1], reverse=True)
print(f'The 5 most central packages according to eigenvector centrality are: {[str(package) for package, centrality in sorted_eigenvector_centrality[:5]]}')
The 5 most central packages according to closeness centrality are: ['pytest', 'requests', 'numpy', 'pytest-cov', 'pandas'] The 5 most central packages according to eigenvector centrality are: ['dissect.cstruct', 'dissect.target', 'dissect.archive', 'dissect.btrfs', 'dissect.cim']
It makes a lot of sense that pytest, requests, numpy, pytest-cov, and pandas are the most central packages according to closeness centrality as these are extremely popular packages that are very commonly used, especially with the increased popularity of machine learning.
It also makes sense that according to eigenvector centrality the dissect packages are the most central as they're part of a 'dissect' family of packages which all just depend on eachother. So these will score high in the eigenvector centrality measure.
closeness_centrality = nx.closeness_centrality(G)
degree = dict(G.degree())
plt.scatter(list(closeness_centrality.values()), list(degree.values()))
plt.xlabel('Closeness centrality')
plt.ylabel('Degree')
plt.title('Closeness centrality vs. Degree')
plt.savefig(f'{image_folder}Closeness_vs_Degree.png', bbox_inches='tight')
plt.show()
Here we see that there is a linear relationship between closeness centrality and the degree. So the nodes with the biggest degree can 'reach' the most nodes.
eigenvector_centrality = nx.eigenvector_centrality(G)
degree = dict(G.degree())
plt.scatter(list(eigenvector_centrality.values()), list(degree.values()))
plt.xlabel('Eigenvector centrality')
plt.ylabel('Degree')
plt.title('Eigenvector centrality vs. Degree')
plt.savefig(f'{image_folder}Eigenvector_vs_Degree.png', bbox_inches='tight')
plt.show()
On the plot above we see that most of the packages don't have 'important' neighbors. Almost all of the nodes -no matter the degree- has 'unimportant' neighbors all beside the dissect cluster from before.
We can also simply plot the most used packages and the packages with the most dependencies:
#Plot of the 20 packages with the highest out-degrees and 20 packages with the highest in degrees:
# Get the out-degrees and in-degrees
out_degrees = dict(G.out_degree())
in_degrees = dict(G.in_degree())
# Sort them
sorted_out_degrees = sorted(out_degrees.items(), key=lambda x: x[1], reverse=True)
sorted_in_degrees = sorted(in_degrees.items(), key=lambda x: x[1], reverse=True)
# Get the top 20 packages
top_out_degrees = sorted_out_degrees[:20]
top_in_degrees = sorted_in_degrees[:20]
# Bar plot of the degrees
plt.figure(figsize=(12, 6))
plt.bar([package for package, degree in top_out_degrees], [degree for package, degree in top_out_degrees], label='Out-degree')
plt.bar([package for package, degree in top_in_degrees], [degree for package, degree in top_in_degrees], label='In-degree')
plt.xticks(rotation=90)
plt.xlabel('Package')
plt.ylabel('Degree')
plt.title('Top 20 packages with the highest out-degrees and in-degrees')
plt.legend()
plt.gca().yaxis.grid(True)
plt.savefig(f'{image_folder}Top_20_degrees.png', bbox_inches='tight')
plt.show()
As expected we see some very well known packages which are often dependencies. The packages with the most dependencies are however much less known.
Finally, we make can for the graph analysis make a plot of the top 10 packages with the total degree count. This plot is purposely made to be similar to Kevin Gullikson's to be able to compare them.
# Get the degrees - here using the undirected network to only count total degrees
degrees = dict(G_undir.degree())
# Sort them
sorted_degrees = sorted(degrees.items(), key=lambda x: x[1], reverse=True)
# Get the top 10 packages
top_degrees = sorted_degrees[:10]
# Bar plot of the degrees
plt.figure(figsize=(3, 6))
plt.barh([package for package, _ in top_degrees[::-1]], [degree for _, degree in top_degrees[::-1]])
plt.xticks(list(range(0, 4001, 500)))
plt.xticks(rotation=90)
plt.xlabel('Degree')
plt.ylabel('Package')
plt.title('Top 10 packages with the highest degrees')
plt.gca().xaxis.grid(True)
plt.savefig(f'{image_folder}Top_10_degrees.png', bbox_inches='tight')
plt.show()
A short summary comparing the above plot to the one in the blog linked before tells that there definitely have been changes in which packages now have the highest degree counts, while other packages have stagnated. Especially numpy, pandas and matplotlib have increase drastically, for numpy it is approximately by 200%, while a package as six have barely changed at all. Another sidenote is the package peppercorn. It is a package that is the only requirement in a project called sampleproject. Unfortuntely, many pacakges link to this repository on their PyPi page. We still included the packages for the additional degree distibution relevance they could have, but in this case the package can be ignore.
Part 2 of Conclusion on Graph Analysis.ΒΆ
The centrality analysis has given us the understanding that there are some important hubs (numpy, pytest, pandas) which are dependencies for many packages. Furthermore, most important packages are used across many unimportant packages.
Together with the earlier graph analysis we have found out that the PyPi package network is not random and tends to be disassortative. This tells us that there exists many packages in the python packages index (PyPi), which use central packages but do not themselves get used by any other packages.
Textual analysisΒΆ
The next step of the analysis is about uncovering communities within the network and try to see if it is possible to characterize them. Our initial hypothesis was that witin the network we might see substructures that relates to different aspects of programming, such as machine learning, webdevelopment, scientific computing, etc. We also want to see if we can use the textual information in the README files to support this analysis.
To partition the network into communities we use the the louvain method, since it works well for large networks. The Louvain method is a greedy optimazation approach for finding communities that essentially works by optimizing for modularity locally to form smaller communities and iteratively combines smaller structures until the gain in modularity has reached a certain small threshold. We are using the Networkx implementation of the Louvian method, as the stucture it returns can be used in their own modularity function.
partition = nx.community.louvain_communities(G_undir)
print(f"There is a total of {len(partition)} communities in the network.")
# Order the partition by size
partition = sorted(partition, key=len, reverse=True)
print({i: len([v for v in p]) for i, p in enumerate(partition)})
print(f"Modularity of network: {nx.community.modularity(G_undir, partition):.4f}")
# Save the partition to a pickle file
with open("data/partition.pkl", "wb") as f:
pickle.dump(partition, f)
There is a total of 78 communities in the network.
{0: 6573, 1: 5619, 2: 3395, 3: 2610, 4: 2268, 5: 1703, 6: 1602, 7: 1502, 8: 1202, 9: 941, 10: 889, 11: 731, 12: 347, 13: 310, 14: 301, 15: 267, 16: 208, 17: 120, 18: 99, 19: 75, 20: 69, 21: 36, 22: 34, 23: 32, 24: 22, 25: 17, 26: 16, 27: 16, 28: 14, 29: 14, 30: 13, 31: 13, 32: 13, 33: 12, 34: 11, 35: 9, 36: 9, 37: 8, 38: 8, 39: 7, 40: 7, 41: 7, 42: 7, 43: 6, 44: 6, 45: 6, 46: 6, 47: 6, 48: 6, 49: 6, 50: 6, 51: 5, 52: 5, 53: 5, 54: 5, 55: 4, 56: 4, 57: 4, 58: 4, 59: 4, 60: 4, 61: 4, 62: 4, 63: 4, 64: 4, 65: 4, 66: 3, 67: 3, 68: 3, 69: 3, 70: 3, 71: 3, 72: 3, 73: 3, 74: 3, 75: 3, 76: 3, 77: 3}
Modularity of network: 0.5237
Since the Louvain method is a greedy heuristic appoach for finding communities, and thereby not deterministic we don't get the same amount of communities each time the algorithm is run. However, every time the algorithm finds between 70-100 communities, with a modularity of around 0.52, indicating that the partioning of the communities are reasonably well done. Furthermore, we see that the algorithm finds a couple of communities that are very large with 1000s of nodes, some medium sized communities that have nodes in the 100s and then many small communities. For the further analysis we will mainly look at the largest communities.
# Load the partition
with open("data/partition.pkl", "rb") as f:
partition = pickle.load(f)
# Create a dictionary mapping each node to its partition
node_partitions = {node: i for i, p in enumerate(partition) for node in p}
# Show packages in the top 10 communities
top_10_communities = {i: len([v for v in p]) for i, p in enumerate(partition)}
top_10_communities = dict(sorted(top_10_communities.items(), key=lambda x: x[1], reverse=True)[:10])
# Get the 10 packages with highest degree count in each community
top_10_packages = {}
degrees = dict(G_undir.degree())
for community_idx in top_10_communities:
community_nodes_in = [node for node in G.nodes if node_partitions[node] == community_idx]
degrees_community = {node: degrees[node] for node in community_nodes_in}
top_10_packages[community_idx] = sorted(degrees_community, key=degrees_community.get, reverse=True)[:10]
# Print the top 10 communities and the top 10 packages in them
for community_idx, packages in top_10_packages.items():
print(f'Community {community_idx}:')
print(', '.join(packages))
print()
Community 0: numpy, pandas, matplotlib, scipy, tqdm, pillow, scikit-learn, pip, torch, python Community 1: pytest, pytest-cov, coverage, flake8, twine, setuptools, black, sphinx, wheel, mypy Community 2: requests, beautifulsoup4, lxml, aiohttp, plotly, bokeh, selenium, bs4, mpld3, docopt Community 3: six, click, urllib3, certifi, idna, packaging, jinja2, pygments, docutils, pyparsing Community 4: pytz, ipython, psutil, foldamers, darkneurons, dxc-rl, neural-mesh-model, openbb-terminal-nightly, jsonschema, deepgnn-torch Community 5: pydantic, boto3, sqlalchemy, python-dotenv, fastapi, httpx, uvicorn, typer, openai, pymongo Community 6: peppercorn, overwatch-api-m, erydex, example-pkg-rdeeban, adj-dataparrots-5, ufwio, bd-nineaxle-matrix, dsmvlib, happy2torch, gym-simplifiedtetris-avela Community 7: python-dateutil, cryptography, openupgradelib, unidecode, openpyxl, xlrd, freezegun, pyjwt, dataclasses, mako Community 8: django, typing_extensions, redis, xmltodict, djangorestframework, indico, psycopg2-binary, html5lib, async-timeout, sqlparse Community 9: flask, enview, gui-for-speedtest-fast, synapse-fm, fsspec, paramiko, squadco, swiggy-analytics, creavel, superset-d2
Here we can clearly see some of the biggest packages in the top 10 communities. We see the first community is heavenly catered towards machine learning and data mining. We then also have community 3 with packages on collecting information of websites with beautifulsoup and requests. We also have some HTML with flask and gui in community 9.
We tokenize all the words in the README files after having removed stopwords, punctuation, stemming and more.
df = pd.read_csv("data/nodes_data.csv")
df = df.dropna(subset=['readme_text']) # Drop rows with missing readme text
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))
def tokenize(text, stop_words=stop_words, stemmer=stemmer):
"""
Tokenize the text by removing stopwords, punctuation, and stemming the words.
Parameters:
- text (str): The input text.
- stop_words (set): A set of stopwords.
- stemmer (PorterStemmer): A stemmer object.
Returns:
- tokens (list): A list of tokens.
"""
if text is None or '':
return ''
tokens = word_tokenize(text)
tokens = [word.lower() for word in tokens if word.isalpha()]
tokens = [word for word in tokens if word not in stop_words]
tokens = [stemmer.stem(word) for word in tokens]
return tokens
# Apply the function to the readme's
df["tokens"] = df["readme_text"].apply(tokenize)
# Combine the tokens from all abstracts into one comprehensive list.
all_tokens = df["tokens"].sum()
However some words might be collocations. Therefore, we check if a pair of words appears more than 50 times and has less than 0.001 p-value it is instead a collocation.
# Bigrams and contingency tables
########## THIS IMPLEMENTATION IS OUR TAKE HOWEVER IT WAS TOO SLOW #############
# bigram_measures = BigramAssocMeasures()
# finder = BigramCollocationFinder.from_words(all_tokens)
# bigram_scores = finder.score_ngrams(bigram_measures.likelihood_ratio)
# for bigram, score in bigram_scores:
# n_ii = bigram_counts[bigram]
# n_io = tokens_no_stopwords.count(bigram[0]) - n_ii
# n_oi = tokens_no_stopwords.count(bigram[1]) - n_ii
# n_oo = len(tokens_no_stopwords) - n_ii - n_io - n_oi
# R1 = n_ii + n_io
# C1 = n_ii + n_oi
# R2 = n_oi + n_oo
# C2 = n_io + n_oo
# N = R1 + C1 + R2 + C2
# E_ii = (R1 * C1) / N
# E_io = (R1 * C2) / N
# E_oi = (R2 * C1) / N
# E_oo = (R2 * C2) / N
# chi_squared = ((n_ii - E_ii)**2 / E_ii) + ((n_io - E_io)**2 / E_io) + ((n_oi - E_oi)**2 / E_oi) + ((n_oo - E_oo)**2 / E_oo)
# p_value = stats.chi2.sf(chi_squared, 1)
# print(f'Bigram: {bigram}')
# print(f'P-value: {p_value}')
# print('-------------------')
######### THEREFORE WE USE THE NLTK IMPLEMENTATION WHICH IS FAST ######################
# Function that creates bigrams from a list of tokens
bigrams = list(make_bigrams(all_tokens))
bigram_counts = Counter(bigrams)
# Function that creates a contingency table for a bigram
# Create a BigramCollocationFinder object that also stores tables at the same time
finder = BigramCollocationFinder.from_words(all_tokens)
# Computes for each unique bigram the chi-squared statistic between the observed and expected contingency tables
# Filter bigrams based on chi-squared
collocations_chisq = finder.score_ngrams(BigramAssocMeasures().chi_sq)
# Compute p-values for the chi-squared statistics
# Calculate chi-squared to get p-value
collocations_p = []
for bigram, chi_squared in tqdm(collocations_chisq):
collocations_p.append((bigram, stats.chi2.sf(chi_squared, 1)))
# Find the list of bigrams that appear more than 50 times and have p-value smaller than 0.001
# Filter collocations based on p-value
collocations = [bigram for bigram, p_value in collocations_p if bigram_counts[bigram] > 50 and p_value < 0.001]
# Find the top 20 of them by number of occurrences
print(f'Top 20 collocations by number of occurrences: {collocations[:20]}')
# Combine collocations into a single token
collocation_tokens = ['_'.join(bigram) for bigram in collocations]
100%|ββββββββββ| 2416327/2416327 [02:39<00:00, 15164.36it/s]
Top 20 collocations by number of occurrences: [('auditlogauditl', 'jsonifierjsonifi'), ('endpointendpoint', 'tencentcloudapicomcvm'), ('hollinworth', 'yoan'), ('jsonifierjsonifi', 'sentrys'), ('secretidsecretkey', 'endpointendpoint'), ('uaio', 'uargpars'), ('umqttrobust', 'umqttsimpl'), ('restructuredtextrst', 'markdownmd'), ('yoan', 'blanc'), ('tencentcloudapicomcvm', 'cvmtencentcloudapicom'), ('ashton', 'honneck'), ('benesch', 'maksym'), ('kostya', 'leschenko'), ('leschenko', 'ashton'), ('maksym', 'balatsko'), ('blanc', 'kostya'), ('bratteng', 'nikhil'), ('nikhil', 'benesch'), ('carey', 'bratteng'), ('edmund', 'crosley')]
Interesting observation is that alot of the collocations found are names of productive github project developers such as Edmund Crosley or Maksym Balatsko.
print(f'Number of collocations: {len(collocation_tokens)}')
Number of collocations: 13581
We use the collocations and regular tokens to tokenize the texts again:
def tokenize_with_collocations(text: str, collocations=collocation_tokens, stop_words=stop_words, stemmer=stemmer):
"""
Tokenize the text by removing stopwords, punctuation, and stemming the words.
Parameters:
- text (str): The input text.
- collocations (list): A list of collocations.
- stop_words (set): A set of stopwords.
- stemmer (PorterStemmer): A stemmer object.
Returns:
- tokens (list): A list of tokens.
"""
if text is None or '':
return ''
tokens = word_tokenize(text)
tokens = [word.lower() for word in tokens if word.isalpha()]
stemmer = PorterStemmer()
tokens = [stemmer.stem(word) for word in tokens]
# Remove stopwords
tokens = [word for word in tokens if word not in stop_words]
# Check all tokens and replace collocations with a single token
for i in range(len(tokens) - 1):
if f'{tokens[i]}_{tokens[i + 1]}' in collocations:
tokens[i] = f'{tokens[i]}_{tokens[i + 1]}'
tokens[i + 1] = ''
tokens = [word for word in tokens if word != '']
return tokens
# Apply the function to the readme's
df["collocations"] = df["readme_text"].apply(tokenize_with_collocations)
# Use the partitioning for the packages get a coloumn with their partitioning in the df:
df['partition'] = df['package'].map(node_partitions)
# Save the dataframe to a csv file
df.to_csv("data/nodes_tokens.csv", index=False)
We combine the communities found with the tokenized words belonging to the packages in the community. This way we can generate a wordcloud for the 10 biggest communities and also show the biggest packages.
# Read the dataframe from the csv file
df = pd.read_csv("data/nodes_tokens.csv")
# Select the top 10 communities
top_10_communities = df['partition'].value_counts().index[:10]
# Filter the dataframe on the top 10 communities
df_top_10 = df[df['partition'].isin(top_10_communities)]
# Create a dictionary with the top 10 communities and the corresponding tokens
top_10_community_terms = {}
for community in top_10_communities:
# The tokens are a long string but should be read as individual lists
tokens = df_top_10[df_top_10['partition'] == community]['collocations'].apply(ast.literal_eval)
tokens = [token for sublist in tokens for token in sublist]
# only use term frequency
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform([' '.join(tokens)])
terms = vectorizer.get_feature_names_out()
top_10_community_terms[community] = terms
# Get the 3 packages with most out nodes in each community
top_3_packages = {}
for community in top_10_communities:
in_degrees = dict(G.in_degree())
community_nodes = [node for node in G.nodes if node_partitions[node] == community]
in_degrees_community = {node: in_degrees[node] for node in community_nodes}
top_3_packages[community] = sorted(in_degrees_community, key=in_degrees_community.get, reverse=True)[:3]
top_10_communities
Index([0.0, 1.0, 2.0, 3.0, 7.0, 6.0, 5.0, 4.0, 8.0, 10.0], dtype='float64', name='partition')
Below the word clouds for the top 10 communities are shown:
# Create a wordcloud with the information on the communities:
for community in top_10_communities:
terms = top_10_community_terms[community]
text = ' '.join(terms)
wordcloud = wc.WordCloud(width=800, height=400).generate(text)
plt.figure(figsize=(10, 10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.text(700, -30, f'Community {community}', fontsize=14, color='black', ha='center')
plt.text(0, -50, f'Top 3 packages: {top_3_packages[community][0]}', fontsize=12, color='black', zorder=1)
plt.text(0, -30, f'{top_3_packages[community][1]}', fontsize=12, color='black', zorder=2)
plt.text(0, -10, f'{top_3_packages[community][2]}', fontsize=12, color='black', zorder=3)
plt.savefig(f'{image_folder}Community_{community}.png', bbox_inches='tight')
plt.show()
We see some communities as the machine learning community with numpy, pandas and matplotlib still are present, but the words in the wordclouds are very special in many cases, where some words are very common across multiple different communities. When search for a word like alia in the csv-files, it also pops up around 5.000 times, so it is a commonly used term, but the meaning it brings doesn't seem to be significant to any specific community. Therefore, the stopwords might need to be tuned towards common words within the coding community and the type of language there. We do also seem some other word appearing in the wordclouds like 'ai', 'databa', 'clean', 'cite', 'conver','impo', and 'code', which are words that do hold great meaning in a coding sense. However, to keep a better meaning of the words, stemming might also need to be done differently and more specific as with stopwords.
DiscussionΒΆ
The network analysis had some very positive results as it showed that the PyPi network was meaningful and it gave some insight on how some packages carry much more importance than others; see the closeness centrality plot and the plots on degree correlation.
The textual analysis showed that some words were specific to certain packages, however it was hard to interpret the exact meaning. We expected that some of the word clouds would be more centered around specific themes which would be recognizable however they were mostly software developer jargon.
Furthermore, potential reasons behind the observed disassortativity in the PyPi network or trying alternative methods for community detection could have enhanced the depth of our analysis. A lot more time could also be spend on understanding the word-cloud, since there is certianly some insight to be gained by going into some of the more obscure words.
There is also some pretty apparant biases which were introduced in the data collection. The PyPi graph which was produced only contained the packages which had a Github repository with some kind of predetermined dependency specification (either requirements-dev.txt/dev-requiremtns.txt/environment.yml/pyproject.toml/requirements.txt on a main or master branch) and while some projects did not have that we know that they still were depending on packages. This bias is pretty big as we go from over 450k packages to just 32k nodes. Less than 10%. However this was necessary for the data collection to be feasible. Furthermore, we argue that a package without this specification might not be that 'important' as anybody can release packages onto PyPi and most likely more 'hobby' projects would not contain a dependency specifcation. The same argumentation goes for the README files.